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1.  Introduction 

The  problem  that  a  relatively  simple  analysis  is  changed  into  a  complex  one 
just  because  some  of  the  information  is  missing,  is  one  which  faces  most  practicing 
statisticians  at  some  point  in  their  career.  Obviously  the  best  way  to  treat  missing 
information  problems  is  not  to  have  them.  Unfortunately  circumstances  arise 
in  which  information  is  missing  and  nothing  can  be  done  to  replace  it  for  one 
reason  or  another.  In  analogy  to  other  accidents — we  don’t  plan  on  accidents, 
nevertheless  they  do  occur  and  safety  measures  must  be  aimed  at  palliating 
consequences  as  well  as  at  prevention.  Consequently,  a  great  volume  of  literature 
has  been  produced,  dealing  with  a  number  of  specific  situations.  An  indication 
of  the  content  of  many  of  these  papers  is  given  in  the  Appendix.  In  this  paper  we 
propose  to  try  to  present  a  general  philosophy  for  dealing  with  the  problem  of 
missing  information,  and  to  give  a  method  which  will  lead  quite  easily  to  maxi¬ 
mum  likelihood  estimates  of  the  parameters  obtained  from  the  incomplete  data 
using  as  nearly  as  possible  the  same  techniques  as  if  the  data  were  all  present. 

Our  first  simple  use  of  the  missing  information  principle  resulted  from  a  con¬ 
versation  in  1946  between  Max  A.  Woodbury  and  C.  W.  Cotterman  resulting 
from  the  latter’s  interest  (Cotterman  [20] )  in  estimating  gene  frequencies  from 
phenotypic  frequency  data.  The  observation  was  made  that  if  one  has  the  geno¬ 
typic  frequencies  NAA,  NAB,  NBA,  NBB,  NAO,  NBO,  NOB,  and  NOO  of  red 
blood  cell  genotypes,  indicated  by  the  second  and  third  letters  of  each  symbol, 
then  the  gene  frequencies  are  easily  computed.  If  N  is  the  total  of  the  above 
frequencies  then  the  estimates  would  be 

pA  =  ^  {2NAA  +  NAB  +  NBA  A  NAO  A  NO  A), 

(1.1)  pB  =  (2 NBB  +  NBA  4-  NAB  4-  NBO  +  NOB), 

ZiDi 

p  =  —  (2  NOO  +  NO  A  A  NAO  A  NOB  +  NBO). 

2  N 
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However,  only  the  phenotypic  frequencies 

MA  =  NAA  +  N AO  +  NOA , 
(1.2)  MB  =  NBB  +  NBO  +  NOB , 

MO  =  NOO , 

MAB  =  NAB  +  NBA, 


are  available. 

If,  however,  one  makes  use  of  Bayes’  theorem  and  the  gene  frequencies  one 
can  obtain  estimates  of  the  genotypic  frequencies  from  the  phenotypic  frequencies 


NAA  =  MA 


A 


(1.3) 


NAO  +  NOA  =  2MA 
NBB  =  MB 


ip  A  +  %PaPo) 
Pa 


=  MA 


Pa 


ip  A  +  2 Po) 


ip  A  +  '*Po) 
Pb 


NBO  +  NOB  =  2  MB 


iPB  +  %Po)' 
Po 


iPB  +  2j»o) 


NAB  +  NBA  =  MAB , 
NOO  =  MO. 


If  one  solves  (1.1)  and  (1.3)  simultaneously  by  equating  the  genotypic  fre¬ 
quencies  in  (1.1)  to  their  estimates  in  (1.3),  one  can  obtain  estimates  pA,  pB, 
and  p0,  which  of  course  are  not  as  good  as  those  obtainable  from  the  true  geno¬ 
type  frequencies  but  which  are  as  efficient  as  the  maximum  likelihood  estimates 
based  only  on  the  phenotypic  frequencies. 

The  problem  with  estimating pA,pB,  and  pQ  by  this  method  is  the  difficulty  of 
finding  rapid  and  accurate  solutions  of  these  equations  and  estimates  of  their 
error  variances.  These  difficulties  are  shared  with  the  method  of  Maximum 
Likelihood  (ML).  This  is  not  too  surprising  since  in  fact  the  two  methods  are 
equivalent.  One  way  in  which  the  two  problems  of  slow  convergence  and  loss  of 
information  may  be  handled  is  by  the  method  of  scoring  which  can  be  modified 
to  work  in  the  presence  of  missing  information.  The  solution  of  the  problem  of 
estimating  gene  frequencies  in  more  general  circumstances  is  provided  by 
Ceppellini,  Siniscialco,  and  Smith  [15],  who  demonstrated  that  the  procedure 
implied  by  the  principle  indicated  above  is  in  fact  ML  in  all  cases.  These  authors 
also  considered  the  increased  variance  of  the  estimates  due  to  the  loss  of  geno¬ 
typic  information  under  the  heading  “hidden  variance.” 

This  missing  information  principle  has  been  applied  to  missing  observations 
in  a  linear  model,  and  in  a  multivariate  normal,  and  to  mixture  problems.  A 
few  examples  will  be  presented  later  in  order  to  demonstrate  the  relative  facility 
with  which  the  principle  may  be  applied. 
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2.  Theory 

The  method  proposed  is  to  regard  the  values  of  the  missing  data  as  random 
variables  within  the  framework  of  a  model  of  the  data.  Thus,  estimates  which 
are  well  defined  when  all  the  data  are  present  become  random  variables  (being 
functions  of  the  missing  data).  This  variation  of  the  estimates  is  in  addition  to 
the  usual  sampling  variation  so  that  the  error  variances  of  the  estimates  are  in¬ 
creased.  The  consequences  of  the  data’s  being  missing  and  some  insight  to  the 
approach  used  here  are  obtained  if  one  considers  replacing  the  missing  data  by 
sample  values  from  the  appropriate  distribution  function.  The  question  is,  from 
which  distribution  function  should  we  sample? 

In  the  independent,  identically  distributed  case  where  the  vector  x{  has  the 
distribution  function  f{x{\9)  and  xt  =  (Yt,  zf)  where  the  vector  zi  contains  the 
missing  components,  then /(a;,|d)  =  fi(yi\9)-f2{zi\9,yi)  is  the  factorization  of 
the  distribution  function  into  the  marginal  distribution  for  yt  and  the  conditional 
distribution  of  zf  given  yt.  The  proper  distribution  to  sample  for  the  missing  data 
then  is  the  conditional  distribution  /(z,|d,  «/,  ),  but  9  is  unknown  so  that  some 
estimated  value  9,  must  be  used.  One  could  draw  many  samples  from  the  distri¬ 
bution  /  and  from  these  completed  data  samples  obtain  the  distribution  of  the 
parameter  estimates  due  to  the  missing  data.  Call  this  distribution  MID(d,  Z). 
If  this  distribution  is  asymptotically  normal,  then  the  mean  will  be  the  obvious 
statistic  to  use  to  provide  an  estimate  in  the  presence  of  missing  data.  If  this 
mean  value  should  be  9,  then  the  estimate  has  not  been  affected  by  the  assumed 
missing  data  distribution.  That  is  the  missing  data  tells  you  nothing.  This  inter¬ 
pretation  of  the  principle  is  due  to  Jacquez.  The  remaining  part  of  the  missing 
information  principle  is  to  equate  the  mean  of  the  MID(0,  Z)  to  9,  or  take  some 
action  equivalent  to  this.  The  effect  of  the  variance  of  MID($,  Z)  (“the  hidden 
variances”)  on  the  error  variance  is  best  understood  in  another  context. 

Before  continuing  with  the  problem  of  missing  information,  it  will  be  of  value 
to  review  the  method  of  maximum  likelihood  estimation.  The  likelihood  function 
of  a  multivariate  data  matrix  X  is  denoted  by  L(X\9),  and  is  defined  to  be 
L(X\9)  =  n?=1  f(Xn\9),  where  f(Xn\9)  is  the  density  function  of  Xn  and  9  is  an 
(sxl)  vector  of  parameters.  The  Score  (Sco)  for  the  parameter  9  is  then  defined 
to  be 

(2.1)  Sco(0,|A)  =  ^>gL(X|0)  =  -  — £(A'  |0). 

The  maximum  likelihood  estimates  for  the  parameters  are  obtained  by  solving 
the  set  of  equations.  It  may  be  readily  deduced  that  the  mean  value  of  the  score 
is  zero  at  the  true  parameter  point;  that  is, 

(2.2) 


(2.3) 


Sco  (0j|X)  =  0  for  j  =  1,  •  •  •  ,  s, 
tf[Sco(fy|X)]  =  0. 
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The  information  matrix  for  9  is  defined  to  be  the  matrix  with  (j,  k)\h  element 

(2.4)  J(6„  e»|X)  =  -E  L^Sco  (9j|X) ] 

-  -£[4klogi<xH 

=  E[Sco(6j\X)  Sco  (9k |X)] 

=  Cov[Sco(0j|X),Sco(0k|X)] 

Under  certain  general  regularity  conditions  (see  Rao  [36] ,  p.  295),  for  example, 
the  existence  of  second  and  third  derivatives  of  the  log  likelihood  function,  it  may 
be  shown  that  the  joint  distribution  of  the  maximum  likelihood  estimators  is 
asymptotically  multivariate  normal  with  a  covariance  matrix  given  by  the  inverse 
of  the  information  matrix.  For  a  detailed  discussion  of  this  subject,  the  paper 
by  Wald  [41]  can  be  consulted.  A  condition  for  the  existence  of  maximum  likeli¬ 
hood  estimates  is  that  the  information  matrix  is  positive  definite.  However,  in 
some  instances,  such  as  in  the  case  of  the  multinomial  distribution,  the  informa¬ 
tion  matrix  has  rank  less  than  «s  and  is  therefore  singular.  If  the  information 
matrix  is  of  rank  s  —  t,  then  we  are  required  to  impose  t  restrictions,  hl(9)  — 
0,  •  •  •  ,  ht(0)  =  0,  on  the  parameters  in  order  to  achieve  identifiability. 

The  problem  of  maximum  likelihood  estimation  of  parameters  subject  to  con¬ 
straints  has  been  studied  by  Aitchison  and  Silvey  [5],  [6],  and  Silvey  [38]. 
These  authors  also  obtained  a  test  statistic  for  the  hypothesis  h(0)  =  0.  However, 
the  situation  of  interest  here  is  when  h(0)  is  a  t  x  1  vector  of  constraints  necessary 
to  produce  an  identifiable  parameter  set.  In  this  case  the  constrained  likelihood 
function  may  be  written  as 

(2.5)  L*  =  log  L(X  1 9)  -  lTh{9 ), 

where  A  is  a  t  x  1  vector  of  Lagrangian  multipliers.  If  we  define  the  constrained 
score 

(2.6)  Sco*(0|X)  =  ~ 
and 


(2.7) 


H 

<sx  r) 


\  Mj  r 


then  we  may  deduce  that  the  expected  value  of  the  constrained  score  is  zero  if 
the  true  parameter  satisfies  the  constraints. 

Considering  the  various  definitions  of  the  information  matrix  listed  in  (2.4), 
we  may  determine  that  a  nonsingular  matrix,  denoted  by  J*,  will  be  obtained 
by  taking  the  negative  of  the  expected  value  of  the  derivative  of  the  constrained 
scores.  However,  if  we  take  the  covariance  of  the  constrained  scores  we  will  call 
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the  singular  information  matrix  J.  This  is  related  to  the  required  nonsingular 
matrix  by  the  equation 

(2.8)  J*  =  J  +  NHHT, 

where  N  is  the  number  of  observations.  This  form  will  lead  to  the  asymptotic 
covariance  matrix  of  the  parameter  estimates  given  by 

(2.9)  V  =  (J*)-1  =  (J-1  - 

It  may  also  be  remarked  that  a  nonsingular  information  matrix  will  not  result 
simply  from  reparametrizing  so  that  the  new  parameters  satisfy  the  constraints, 
unless  some  are  eliminated. 

Let  us  now  return  to  the  situation  where  we  have  missing  information.  Suppose 
that  we  cannot  observe  the  random  variable  X,  but  can  instead  observe  some 
image  F ,  of  it.  The  likelihood  function  for  Y  may  be  obtained  by  integrating 
L(X\6)  over  the  appropriate  range,  and  thus  we  may  write 

(2.10)  L(X|0)  =  Lj(X|  Y.  0)  L2(Y\0) 
giving 

1  3  Tj 

(2.11)  Sco(0,|X)  =  — +  Sco(0j|F). 

The  item  (1  /Lx)  (dLJdOj)  is  called  the  conditional  score  of  Oj  from  X,  given  Y, 
and  this  is  denoted  by  Sco  ( 6j ,  X|  F).  It  may  be  noted  that 

(2.12)  tf[Sco(0,,X|F)|F]  =  0, 

the  truth  of  this  following  from  the  same  reasoning  as  was  used  to  establish  (2.3). 
Finally  we  have 

(2.13)  Sco  (6j\  Y)  =  £[Sco(0,.|X)|F] 
and 

(2.14)  ^[Sco(0j|X)Sco(0k|X)]  =  Cov{^[Sco(0j|X)|F],^[Sco(0k|X)|r]} 

+  E{ Cov  [Sco(0j|X),  Sco(0k|X)|F]} 

which  leads  to 

(2.15)  E[ Sco  (dj  |  X)  Sco  (0k  \X)]  =  ^[Sco  (0,- 1  F)  Sco  (0k  |  F)] 

+  E{ Cov  [Sco  (0j\X),  Sco  (0k|X)|  F]}. 

For  the  information  matrix  this  gives 

(2.16)  J(6,  0|X)  =  J(d,  0|  F)  +  J(9,  0;  F|X). 

The  last  quantity  on  the  right  J(6,  0 ;  F  |  X )  is  what  is  termed  the  lost  information. 
For  brevity,  equation  (2.16)  may  be  written  Jx  =  JY  +  Jx/y>  where  JX/y  =  Jx  ~ 
JY  =  the  lost  information. 
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We  may  now  easily  obtain  a  relationship  between  the  lost  information  and 
the  increase  in  variance  of  the  parameter  estimates  (the  “hidden  variance"  of 
Ceppellini,  Siniscialco.  and  Smith  [15]  ).  This  relationship  derives  from 

(2.17)  1  ~  ^x1)  —  (^x~  Jy)Jy  1 
which  we  may  write  as 

(2.18)  J y  —  Jx  1  =  Jx  1  {Jx  ~  Jy)Jy 

(2.19)  Jx  -  JY  =  Jx{Jyl  ~  Jx1)Jy- 

If,  for  simplicity,  we  write  A  ^  B  when  the  matrix  difference.  C  =  A  —  B.  is 
positive  semidefinite,  then  we  have 

(2.20)  JyX^Jx\ 

(2.21)  Jx  ^  JY. 

We  may  now  obtain  bounds  on  the  hidden  variance  in  terms  of  the  lost  inform¬ 
ation.  and  vice  versa.  These  are 


(2.22) 

(2.23) 


j-  i  j 
J  X  JX\ Y 

Jy(Jyl 


Jx  1  ^  (Jy1  -  J X 
J  x  1  )Jy  =  J  x  |  y 


1  \  <  /-  i  /  /-  i 

/  =  JY  J  X\ YJY  ■ 


These  may  be  of  value  in  practical  situations  where  some  of  the  quantities  are 
easier  to  obtain  than  others.  The  widths  of  these  limits  depend  on  the  amount  of 
missing  information  and  are  “tight”  when  this  is  small. 

The  usefulness  of  the  above  theory,  and  in  particular  result  (2.13),  in  estimating 
parameters  in  the  presence  of  missing  information  is  that  it  is  often  quite  easy  to 
obtain  the  right  side  even  in  those  cases  when  it  is  extremely  difficult  to  obtain 
the  left  side. 


3.  Examples 


3.1.  Example  1.  Consider  first  the  case  of  missing  observations  in  a  linear 
model.  Suppose  that  Y  is  a  set  of  independent,  normally  distributed,  random 
variables  having  a  common  variance  of  a2,  and  a  mean  of  X9,  where  X  is  an 
n  x  k  design  matrix  and  9  is  a  k  x  1  vector  of  unknown  parameters.  Then  we 
have 

XT(Y  -  X9) 

(3.1.1)  Sco(0|F)  =  — - — j - -, 


(3.1.2) 


Sco  (a2 1  Y) 


n  (Y  -  X9)T(Y  -  X9) 
2<?  2a* 


Equating  these  to  zero  gives 

(3.1.3)  9  =  (XtX)~1XtY, 

a2  =  i  (Y  -  X9)t(Y  -  X9) 


(3.1.4) 
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This  estimator  for  a1  is,  of  course,  well  known  to  be  biased,  the  unbiased  esti¬ 
mator  being 


(3.1.5) 


(F  -  X0)T(Y  -  X6) 
n  —  k 


where  k  is  the  rank  of  the  design  matrix  X. 

Suppose  that  there  are  n  potential  observations  but  that  there  are  m  missing, 
leading  to  an  image  of  F  being  the  vector  F0  of  observed  values.  Due  to  the 
assumed  independence  of  the  observations  the  conditional  expectation  of  the 
scores  may  be  easily  computed.  Also  the  observed  values  are  unaltered  whilst 
functions  of  the  missing  values  are  replaced  by  their  expected  values.  Hence 


(3.1.6) 


Sco(0|Fo)  =  ±jXT(Y0  -  XmQ  -  X9) 


(3.1.7)  Sco  (cr2 1  F0)  =  5-3  “ 


(Fn  -  X09)T(Y0  -  Xo0 ) 
2a2 


here  X  and  X0  are  (n  x  k)  matrices  such  that  X  =  Xm  +  X0,  and  Ym  and  F0 
are  (n  x  1)  vectors  such  that  F  =  Fm  +  F0.  The  following  estimators  are  thus 

obtained : 


(3.1.8) 

(3.1.9) 

(3.1.10) 


9  =  (XTX)~1XTY, 

Y  =  Yo  + 

ir_,y  x  8)T  <r»  - 

a  -  (r0  X0U)  n_m_k 

n  —  m  —  k 


It  may  be  noted  that  9  can  be  eliminated  between  (3.1.8)  and  (3.1.9)  to  obtain 

(3.1.11)  Ym  =  [/  —  Xm(XTX)-'XZY'Xm(XTX)-'XlY0. 

This  equation  provides  a  form  of  estimating  missing  data  which  is  easy  to  com¬ 
pute  for  a  single  value.  It  is  proposed  to  use  this  simple  form  to  obtain  an  initial 
set  of  estimates  for  the  missing  values,  and  then  to  cycle  iteratively  throug 
equations  (3.1.8)  and  (3.1.9)  until  the  parameter  estimates  stabilize.  This  is  the 
modification  of  Yates’  [50]  approach  to  the  problem ,  as  proposed  by  Tocher  L39J . 
The  lost  information  for  estimating  0  may  be  shown  to  be  {XmXm)a  ,  thus 

the  covariance  matrix  for  §  may  be  written 

(3.1.12)  Cov(0,  6)  =  (XTX  -  XlXm)  la2  _  t 

=  {(XTX)_1  +  (XTX)  1Xlt[I  -  Xm(X  X)  xm]  Xm 

(XTX)->2. 
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Therefore  the  quantity  to  be  added  to  Cov  ( 9 ,  §)  so  as  to  correct  for  “bias”  is 

(3.1.13)  B  =  {(XTX)-‘X£[/  -  Xm(X7'X)-'Xj]-‘Xm(XrX)-‘}ff2. 

Since  we  recover  none  of  the  lost  information,  in  general  the  main  reason  for 
using  the  procedure  proposed  here  is  that  the  design  matrix  X  for  the  complete 
data,  where  we  have  a  balanced  design,  may  be  chosen  such  that  XTX  is  diagonal, 
and  hence  much  easier  to  invert  than  the  general  matrix  XqX0  which  would 
result  from  using  the  available  data  only.  However,  it  should  be  noted  that  in 
order  to  correct  for  the  bias  in  Var  (9),  it  is  necessary  to  invert  the  matrix 
[/  —  XZl(XTX)~1  Xm].  This  is  of  order  m  and  generally  has  a  regular  pattern,  but 
it  may  not  be  diagonal.  It  is  therefore  felt  that  if  the  number  of  missing  values  m 
equals  the  rank  of  the  design  matrix  then  there  is  less  to  be  gained  from  using  the 
procedure  outlined  here. 

3.2.  Example  2.  A  second  example,  which  has  received  considerable  atten¬ 
tion,  is  that  of  missing  components  in  a  multivariate  normal  distribution.  This 
has  been  considered  by  Woodbury  and  Hasselblad  [47]  but  it  is  being  included 
here  due  to  its  great  interest.  The  log  of  the  likelihood  function,  for  a  sample  of 
size  N,  may  be  written 

(3.2.1)  log  L(X\p,  2)  =  C  +  -  jN  tr  'L~1S, 

where  S  =  E*=  j  (Xn  -  p)  (Xn  -  p)T/N. 

The  parameter  scores  are  easily  obtained  and  are 

(3.2.2)  Sco  (p\X)  =  iVZ_1(X  -  p) 
and 

(3.2.3)  Sco(2;|X)  =  -  %N(L~X  -  ST^ST1)- 

By  equating  these  to  zero  we  can  obtain  the  parameter  estimates 

(3.2.4)  t  =  S 
and 

(3.2.5) 

n=  1  jy 

Equation  (3.2.3)  is  obtained  by  observing  that  the  derivative  of  the  logarithm  of 
a  determinant  of  a  matrix  with  respect  to  an  element  of  that  matrix  is  simply  the 
corresponding  element  of  the  inverse,  and  that  52  _1  =  —  E_1(5Z)2_1.  We 
also  used  such  properties  as  tr  (AB)  =  tr  (BA),  and  tr  (a)  =  a  if  a  is  a  scalar. 
Additionally  we  note  that  <jl,j  =  a*’  \  although  this  fact  is  not  used  in  obtaining 
Sco  (a ij).  The  solution  of  the  normal  equations  will  not  be  affected  since  we 
obviously  have  Sco  (oiti)  =  Sco  (<rJ  t).  The  reason  for  forming  the  score  of  the 
covariance  matrix  instead  of  the  score  of  the  inverse,  as  is  more  normal,  is  that 
this  will  greatly  simplify  the  computation  of  the  information  for  £,  as  will  be 
developed  later. 
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The  image  Y  of  X  consists  of  the  observed  components  of  the  data  matrix  X. 
If  we  assume  that  there  are  p  components  then  an  individual  observation  is  a 
p  x  I  vector  which  may  be  written  in  the  form 

(3.2.6)  Yk  =  Yk<0  +  Ykm, 


where  Yk  0  is  the  observed  portion,  with  zero  in  each  position  corresponding  to  a 
missing  component,  and  Yk  m  is  the  estimated  missing  portion,  with  again  zero 
in  the  positions  corresponding  to  an  observed  component.  It  should  of  course 
be  remarked  that  if  there  are  no  missing  components  then  Yk  0  constitutes  the 
entire  vector.  To  obtain  the  conditional  expectation  of  the  scores  for  the  mean 
we  have  to  solve  the  regression  of  the  missing  data  Yk  m  on  the  observed  data 
Yk  0.  Similarly  the  conditional  expectation  of  the  scores  for  the  covariance 
matrix  requires  the  conditional  covariance  matrix  of  the  missing  data  given  the 
observed  data.  The  following  estimators  are  obtained : 


(3.2.7)  H  =  i  £  K„, 

^  11=1 

(3.2.8)  t  =  \  £  [(f„  -  fi)(Yn  -  p)T  +  Vn], 

iV  „=i 

(3.2.9)  Ykm  —  fim  +  2m>0  2-o,  0(^,0  —  fio) 

=  pm  +  (S"*’M)-15:M’0(rk>0  -  to), 

where  is  ap  x  p  matrix,  for  the  rath  observation,  with  £m>m  -  Smio2o,oso,m> 
(=  (Xm’m)-1),  in  the  positions  corresponding  to  the  missing  components  and 
zero  elsewhere.  It  should  be  noted  that  partitions  such  as  ( Yk>m ,  Yk>0)  vary  from 
observation  to  observation  depending  on  which  components,  if  any,  are 
missing. 

The  lost  information  for  the  mean  due  to  a  single  observation  may  be  shown 
to  be 


(3.2.10) 


LI  in)  = 


This  may  be  deduced  intuitively  since  we  would  lose  Z  _  1  by  discarding  a  com¬ 
plete  observation,  whereas  the  procedure  described  here  recovers  the  amount 
£q  q  contained  in  the  observed  portion.  Once  again  the  components  of  Z0  0 
vary  from  observation  to  observation  and  will  be  the  entire  matrix  if  no  com¬ 
ponent  is  missing. 

Since  there  are  only  q  =  %p{p  -1-  1)  distinct  elements  in  the  covariance  matrix, 
/(Z,  Z)  will  be  a  q  x  q  matrix.  However,  it  is  proposed  to  regard  it  as  being 
composed  of  p2  distinct  elements  and  then  to  gather  the  terms  in  the  p2  x  p2 
information  matrix  /*( Z,  Z)  to  give  the  required  J (Z,  Z).  Thus  we  have 
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(3.2.11) 


N2 

J*(L,  Z)  =  ^-£[Z_1  ®  Z_1  -  I"1  <g>  Z^SZ"1 

-  Z-^SZ-1  ®  Z_1  +  Z'^Z"1  ®  Z'^Z"1] 
=  ®Z-1)(S®fi,)(Z"1  0Z-1)  -  Z'1  (g)!"1] 


A  typical  element  in  E(S  ®  S)  may  be  shown  to  be 


(3.2.12) 


E(si,jsu,v)  —  Y  \.ai,jffu,v  +  ai,uaj,v  +  ai,vffj,u] 


Hence  (3.2.11)  reduces  to 

(3.2.13)  /*(Z.  Z)  =  j(Z_1  ®  Z_1)((7 UuoV'j  +  aitVaUtj)(L~x  ®  Z_1) 

=  -{oi'uav'j  +  ai'v<juJ). 

4 

To  obtain  these  equations  it  is  necessary  to  use  one  of  the  properties  of  the 
Kronecker  product,  namely,  ( A  ®  B)(C  ®  D)  —  AC  ®  BD.  To  obtain  /( Z.  Z) 
we  simply  collapse  this  matrix  noting  that  /(<t,-  ou  u)  consists  of  one  element 
like  that  shown  on  the  right  side  of  (3.2.13).  J{(Titi.  cu  v)  consists  of  the  sum  of 
two  such  elements,  and  ou<v)  consists  of  the  sum  of  four  such  elements. 

The  point  of  obtaining  /*( Z.  Z)  is  that  it  provides  a  means  of  computing  the 
information  matrix  in  the  presence  of  missing  data  since  J*  is  the  sum  of  the 
corresponding  matrices  for  each  observation  vector,  which  may  differ  from 
observation  to  observation  in  the  case  of  missing  data.  If  we  write  each  observa¬ 
tion  in  the  form  Yk  =  Yk  0  +  Yk  m  then  Z0  0  is  the  submatrix  of  Z  correspond¬ 
ing  to  Yk  0.  The  retained  information  is  then  obtained  by  accumulating  those 
elements  of  (3.2.13)  that  correspond  to  Zq.o-  Once  this  has  been  done  then  / 
can  be  obtained  by  combining  terms  and  from  this  the  lost  information  may  be 
computed,  as  can  the  covariance  matrix  of  the  estimated  covariance  matrix, 
covariance  matrix  of  the  estimated  covariance  matrix. 

Computationally,  the  procedure  followed  is  to  group  the  observations  into 
classes  of  identical  patterns  of  missing  and  observed  components.  Initial  esti¬ 
mates  of  the  mean  and  covariance  matrix  are  obtained  using  (3.2.4)  and  (3.2.5) 
on  the  complete  vectors,  if  there  are  any.  Then  (3.2.9)  is  used  to  get  initial 
estimates  of  the  missing  values  and  the  completed  data  used  in  (3.2.4)  and  (3.2.5) 
to  get  new  estimates  of  the  parameters.  Finally  the  covariance  matrix  is  corrected 
for  bias  by  adding  quantities  like  (Zm’m)_1  as  indicated  by  (3.2.8).  If  there  are 
no  complete  vectors,  it  is  proposed  to  use  some  good  initial  guess  of  the  missing 
data  and  then  to  start  the  cycle  with  (3.2.4)  and  (3.2.5)  as  before.  It  should  be 
noted  that  the  theory  of  partitioned  matrices  is  quite  useful  in  reducing  the 
amount  of  computation  since  the  inverse  of  the  quite  large  matrix  Z0  0  can  be 
easily  expressed  in  terms  of  the  submatrices  of  Z-1,  as  can  Zm  0Zq  q.  (See 
Woodbury,  M.A.,  “Inverting  modified  matrices,”  Stat.  Res.  Group.  Princeton, 
N.J.  Memo.,  Vol.  42,  1950.) 
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The  convergence  is,  in  certain  cases,  quite  slow  and  hence  methods  of  speeding 
it  must  be  used.  It  must  also  be  remembered  that  the  correct  procedure  to  follow 
for  any  analysis,  such  as  multiple  regression,  is  to  use  the  corrected  covariance 
matrix  and  the  mean  as  data,  since  using  only  the  values  predicted  by  (3.2.9)  will 
give  rise  to  a  biased  covariance  matrix.  The  necessary  theory  is  quite  easy  to  work 
out  and  computer  programs  have  been  written.  The  available  information  for 
estimating  the  mean  is  most  easily  obtained  by  accumulating  the  portion  Zq  q 
contained  in  the  observed  components. 

3.3.  Example  3.  Mixture  problems  may  be  regarded  as  missing  information 
problems  by  noting  that  the  indicator  variable  Z„  k  (which  is  1  if  the  wth  observa¬ 
tion  is  in  the  fcth  class  and  zero  otherwise)  is  missing.  If  this  was  available  then  the 
constrained  log  likelihood  would  be 

(3.3.1)  L  =  £  Z  Zn,kllogpk  +  fk(xn |0*)]  -  n(  £  pk  -  1 

n= 1  k=l  \k= 1 

from  this  we  may  obtain  the  score  for  pk  as 

(3.3.2)  Sco(A|*„,  Z„)  =  Sco  {pi\Z„)  =  £  ^  -  N, 

n=  1  Pk 

whilst  the  score  for  0k  is 

(3.3.3)  Sco  (0, |x„,  Z„)  =  £  Z„,t  j  ^-/s(*„ |0»). 


If  there  is  no  missing  data  these  equations  may  be  separated  into  K  classes,  one 
set  for  each  class. 

If  we  do  not  observe  the  Zn  k,  then  the  image  of  (x„,  Znl,  •  •  •  ,  Znk)  is  just  x„ 
and  hence  we  must  find  E(Z„  k\x„)  which  is  the  posterior  probability 


(3.3.4) 


P[k\x„~\ 


Pjk{Xn\0k) 

f{*n  I#) 


where 


(3.3.5) 


f(*n\0)  =  Z 
k=  1 


The  image  scores  are 

(3.3.6)  Sco(Pl|Z)=  £^W_iv 

n=  1  Pk 


Sco(0t|X)=  |  i>[*|x„]i^/1(x„|0t) 
=  Y,  -P[*|x„]  Scot  (0j|x„). 

n=  1 


and 

(3.3.7) 
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The  following  estimating  equations  are  obtained 
(3.3.8)  Nk  =  £  />[%.] 

n  =  1 

and 


(3.3.9) 


The  information  computations  require  expressions  for  the  expected  values  of  the 
second  derivative  of  the  likelihood  functions.  Although  these  expressions  do 
not  have  a  form  which  can  be  easily  evaluated  for  the  mixture  of  normals,  or 
any  other  standard  distribution,  they  are  being  recorded  here  for  the  sake  of 
completeness : 


(3.3.10) 


82L 

dpidpj 


1 

PiPj 


N 


z 


p[*|  XdP\_j\Xn\ 


(3.3.11) 


J*(PiPj)  =  N 


f 


fMMMMlox  -NJ*  W) 

f(x\0)  J* -««<•>. 


where  is  the  above  integral, 

(3.3.12) 


-  Seri  niMPUM  Sco  (0j|x")’ 


(3.3.13) 

(3.3.14) 

(3.3.15) 


J*(Pi »  Oj)  =  Npi 


dJjjO) 

dd; 


d2L  N 

=  Z  ^>[*K]-pMa?i.]scol(0J|a?ll)Scoj(0i|a?l,), 


dOidOj  « = i 
Oj)  =  Np^j 


d2Jfj(0) 

dOidOj 


The  overall  nonsingular  (unconstrained)  information  matrix  is 


(3.3.16)  J*  = 

and  its  inverse  is  V*.  We  note  that 


J*(p,p)  J*(p,6) 

| _J*{9,p)  J*(d,d)_ 


(3.3.17) 
so  that 


/* 


y* 


M 

_el 

L°J 

_oJ 

re_ 

\p~ 

k 

LoJ 

(3.3.18) 
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and 

(3.3.19) 


l> 


=  1 


where  e  is  a  vector  of  ones.  Thus  the  final  covariance  matrix  of  the  estimator  is 


(3.3.20) 


y* 


V*(p.  p)  —  ppT 

v*(0,p) 


v*(P,  e ) 

V*(0 ,  0) 


The  properties  of  J*  and  V*  discussed  in  the  papers  of  Aitchison  and  Silvey 
[5],  [6]  on  constrained  maximum  likelihood  estimation  are  also  shared  by  the 
approximating  sums  of  partial  derivatives. 

3.4.  Example  4.  Consider  now  the  problem  of  estimating  the  parameters  of 
a  mixture  of  multivariate  normal  populations  assumed  to  have  equal  covariance 
matrices  but  different  means.  Suppose  that  there  are  K  populations  and  that 
we  sample  N  observations.  Thus,  the  data  matrix  X  would  have  consisted  of  K 
submatrices,  of  Nk  observations  from  population  k  for  k  =  1,  •  •  •  ,  K.  Instead, 
we  have  the  image  Y  which  consists  of  N  observations  from  the  mixture.  If  we 
had  considered  Y  to  consist  of  a  single  population  then  we  would  have  obtained 

(3.4.1)  Sco(/i|F)  =  Z-‘  £  <r„  -  ji) 

n=  1 

leading  to 

(3.4.2)  a  =  I  f. 

n-  1 

and 

(3.4.3)  Sco(X-‘|7)  =  +  i  £  (¥,-  ?)(Yn  -  7)T 

n=  1 

leading  to 

(3-4.4)  ^  =  1  £  (Yn  -  THY,  -  Y)T. 

n=  1 

However,  regarding  Fas  a  mixture,  we  may  write  the  likelihood  function  as 


(3.4.5) 

(3.4.6) 


HY\h,y.)  =  X 


Z  Pkfk(x„) 


n  =  1  ]_k=  1 


log  L  =  £  log 


X  pk(2n)-<”2\T.\~'i2 


k=  1 


exp  {-  %{Xn  -  pk)TIk  l(X„  -  pk)} 


Np  N  I,, 

—  log  271  +  -log|S  x| 


N 

+  Z  log 

n=  1 


Z  Pk  eXP  -  VkfZ  ,)} 

fc=  1 
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Due  to  the  imposed  restriction,  that  Zf=  lpk  =  1,  we  are  required  to  maximize 
the  function 


(3.4.7) 


L*  =  log  L  -  X  pk  -  l^j. 


where  k  is  a  Lagrangian  multiplier. 

Defining  Sco  (d|X)  =  ( dL*/dX )  log  L*  we  obtain 


(3.4.8) 
where 

(3.4.9) 


n=  1  J  (An) 


f(Xn)  =  £  Pkfk(Xn)- 
k=  1 


Multiplying  by  pk  and  summing  over  k  gives  k  =  N,  hence 
(3.4.10)  y  *4^  =  N  for  k  =  ,  K. 


n?l  /(*„) 

If  we  introduce  the  posterior  probability 


(3.4.11) 

and  define 

(3.4.12) 
we  obtain 

(3.4.13) 
and 

(3.4.14) 
Finally 

(3.4.15) 


pdk(Xn) 


/(X„) 

A’„  =  Z  r[fc|x.], 


N 

Z 

n  =  1 


Pk  = 


Nk 

N 


h  =  Z  P[k  |X.]£. 

n  =  1  A  k 


i,.j  =  Z  Z  4p[fc|x„](x..,  -  Ai.i)(x„,y  -  Aj)- 

n  =  1  fc  =  1  iV 


If  we  call  this  estimate  of  Z  the  within  class  covariance  matrix  and  denote  it  by 
Z(w),  then  we  may  write 


(3.4.16)  -tf  P[fc|X„](X„  -  p  +  fL  -  pk)(X„  -ii  +  ji-  p  J 

n  =  1  k=  1 
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=  Z  Z  PWXJ[(X«  “  ^)(Xn  -  V)T  +  (XH  -  H)(V-  llk)T 

n= 1  k= 1 

+  (P  ~  Pk)(%n  ~  Pf  +  (P  ~  MkXP  ~  flk)T] 

=  jr  (Xn-  p)(xn  -  pf 

n=  1 

“II  P\Jc  |  Xn]  (jl  -  pk)(fi  -  iik)T. 

n  ~  1  k  =  1 

Thus  we  may  write 

(3.4.17)  S(vv)  =  S(T)  -  I(B). 

where  Z(T)  is  the  total  variance  and  E<B)  is  the  between  population  variance. 

The  computational  procedure  proposed  is  to  take  some  good  guess  as  an 
initial  set  of  estimates  for  the  parameters  pk,  pk,  £.  Then  to  cycle  iteratively 
through  (3.4.14),  (3.4.15),  and  (3.4.13)  until  the  parameter  estimates  stabilize. 
At  each  stage  we  use  the  best  parameter  estimates  available.  As  is  usual  with 
such  iterative  procedures  the  convergence  may  be  slow  and  will  require  speeding 
in  practice. 


APPENDIX 

Brief  Review  of  the  Literature  and  Historical  Development 

Although  considered  earlier  by  Allan  and  Wishart  [7],  the  first  general 
approach  to  the  problem  of  missing  data,  in  field  experiments,  was  that  of  Yates 
[50],  who  provided  formulae  enabling  the  least  squares  estimate  of  a  single 
missing  datum  to  be  computed  from  the  row  and  column  sums.  He  also  provided 
a  similar  formula  to  correct  for  the  bias  introduced  into  the  sums  of  squares, 
and  suggested  that  the  estimation  formula  could  be  used  iteratively  if  there  was 
more  than  one  missing  datum.  However,  no  general  correction  for  the  bias  in 
the  sums  of  squares  was  given.  The  basic  ideas  behind  the  approach  of  Yates 
have  been  used  by  many  authors  since  that  time  to  cover  most  common  linear 
models  (see  code  Ul).  It  has  also  been  used  for  a  factor  analytic  model  (see 
Woodbury,  Clelland,  and  Hickey  [46],  Woodbury  and  Siler  [48]),  but  it  is 
restricted  to  the  univariate  case  with  independent  observations.  A  second 
approach  to  the  problem  was  the  “covariance  method”  (code  U2)  of  Bartlett 
[11]  ,  (see  also  Coons  [17])  which  results  in  the  same  estimates  as  Yates  but  which 
also  gives  unbiased  sums  of  squares.  Tocher  [39]  described  a  method  (code  U3) 
which  appears  to  be  a  combination  of  the  approaches  of  Yates  and  Bartlett. 
This  method  gives  biased  sums  of  squares  but  a  general  correction  is  given.  A 
final  method  (code  U4)  for  univariate  random  variables  is  the  iterative  maximum 
likelihood  method  described  by  Hartley  [28].  This  involves  replacing  the  missing 
values  by  their  expected  values,  given  the  model  and  the  parameters.  Examples 
were  given,  by  Hartley,  for  a  number  of  discrete  distributions  having  sufficient 
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statistics  and  for  which  the  maximum  likelihood  parameter  scores  were  linear  in 
the  observations. 

The  multivariate  normal  has  been  extensively  studied ;  the  first  approach  being 
the  direct  application  of  maximum  likelihood  (code  Ml).  The  first  step  was 
made  by  Wilks  [45]  who  considered  the  general  bivariate  case.  Special  tri variate 
cases  were  considered  by  Lord  [34]  and  Edgett  [23].  Their  approaches  were 
greatly  simplified  by  Anderson  [9]  who  considered  the  general  “nested”  case  for 
which  the  likelihood  function  could  be  factored.  Trawinski  and  Bargmann  [40] 
used  the  notation  of  Roy’s  general  linear  model  to  write  the  likelihood  equations, 
and  gave  an  example  of  a  trivariate  case  for  which  each  type  of  incomplete  vector 
was  observed  an  equal  number  of  times.  Another  approach  to  the  problem  of 
testing  hypotheses  in  such  situations,  where  the  components  are  missing  by 
design  rather  than  accident,  is  given  by  Kleinbaum  [31].  Hocking  and  Smith 
[29]  approached  the  problem  by  sequentially  combining  covariance  estimators 
by  adding  one  group  at  a  time,  starting  with  the  complete  observations.  This 
method  is  statistically  efficient  and  is  the  same  as  Anderson’s  maximum  likelihood 
solution  in  the  nested  case.  The  above  methods  all  appear  to  be  valid  approaches 
to  the  multivariate  case.  A  method  very  similar  to  that  of  Hartley  [28]  was 
described  by  Federspiel,  Monroe  and  Greenberg  [24]  (although  used  earlier  by 
Greenberg).  This  is  an  iterative  method  which  involves  replacing  the  missing 
components  by  their  conditional  expectation,  given  the  observed  components. 
Although  computationally  simple,  it  gives  rise  to  biased  estimates  of  the  covari¬ 
ance  matrix  (and,  hence,  of  the  mean),  since  the  score  contains  quadratic 
functions  of  the  observations.  Buck  [14]  used  a  similar  approach  and  also  cor¬ 
rected  for  the  bias  in  the  covariance  matrix,  in  the  case  of  a  single  missing 
component.  However,  he  failed  to  give  the  correct  extension  to  more  than  a 
single  missing  component.  Another  approach  (code  M3),  which  could  give  rise 
to  a  nonpositive  definite  covariance  matrix,  is  the  use  of  all  the  available  data  to 
estimate  each  component  of  the  mean  and  covariance  matrix  separately.  This 
has  been  described  by  Glasser  [25]  and  Haitovsky  [27].  An  extensive  literative 
review  was  given  by  Afifi  and  Elashoff  [2],  who  also  considered  some  of  the 
methods  discussed  here,  as  well  as  many  of  the  methods  in  common  use.  One  can 
determine  from  their  extensive  analysis  that  most  approximations  are  best  “quick 
and  dirty”  and  at  worst  misleading. 

Missing  information  problems,  as  distinct  from  missing  data  problems,  are 
many  and  varied,  some  being  recognized  as  such,  some  not.  Examples  of  the  lack 
of  identifiability  in  mixture  problems  have  been  presented  by  Behboodian  [12] 
for  a  mixture  of  univariate  normal  populations,  by  Wolfe  [49]  for  a  mixture  of 
multivariate  normal  populations,  and  by  Cohen  [16]  for  the  negative  binomial, 
while  the  iterative  methods  used  by  Hartley  [28]  can  deal,  in  addition,  with  the 
problem  of  lost  information  due  to  grouping,  censoring  and  truncating.  The 
problem,  (code  12),  of  obtaining  genotype  frequencies  from  phenotype  fre¬ 
quencies  has  been  dealt  with  by  Ceppellini,  Siniscialco,  and  Smith  [15]. 
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